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Abstract 


We introduce a neural network with a recuri'ent attention model over a possibly 
large external memory. The architecture is a form of Memory Network 1^ 
but unlike the model in that work, it is trained end-to-end, and hence requires 
significantly less supervision during training, making it more generally applicable 
in realistic settings. It can also be seen as an extension of RNNsearch lO to the 
case where multiple computational steps (hops) are performed per output symbol. 

The flexibility of the model allows us to apply it to tasks as diverse as (synthetic) 
question answering and to language modeling. For the former our approach 
is competitive with Memory Networks, but with less supervision. For the latter, 
on the Penn TreeBank and TextS datasets our approach demonstrates comparable 
performance to RNNs and LSTMs. In both cases we show that the key concept 
of multiple computational hops yields improved results. 

1 Introduction 

Two grand challenges in artificial intelligence research have been to build models that can make 
multiple computational steps in the service of answering a question or completing a task, and 
models that can describe long term dependencies in sequential data. 

Recently there has been a resurgence in models of computation using explicit storage and a notion 
of attention Il23l l8ll2l: manipulating such a storage offers an approach to both of these challenges. 
In Il2^ l^l2ll. the storage is endowed with a continuous representation; reads from and writes to the 
storage, as well as other processing steps, are modeled by the actions of neural networks. 

In this work, we present a novel recutTent neural network (RNN) architecture where the recuri'ence 
reads from a possibly large external memory multiple times before outputting a symbol. Our model 
can be considered a continuous form of the Memory Network implemented in Il23l . The model in 
that work was not easy to train via backpropagation, and required supervision at each layer of the 
network. The continuity of the model we present here means that it can be trained end-to-end from 
input-output pairs, and so is applicable to more tasks, i.e. tasks where such supervision is not avail¬ 
able, such as in language modeling or realistically supervised question answering tasks. Our model 
can also be seen as a version of RNNsearch Q with multiple computational steps (which we term 
“hops”) per output symbol. We will show experimentally that the multiple hops over the long-term 
memory are crucial to good performance of our model on these tasks, and that training the memory 
representation can be integrated in a scalable manner into our end-to-end neural network model. 

2 Approach 

Our model takes a discrete set of inputs xi,..., x„ that are to be stored in the memory, a query q, and 
outputs an answer a. Each of the Xi, q, and a contains symbols coming from a dictionary with V 
words. The model writes all x to the memory up to a fixed buffer size, and then finds a continuous 
representation for the x and q. The continuous representation is then processed via multiple hops to 
output a. This allows backpropagation of the error signal through multiple memory accesses back 
to the input during training. 
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2.1 Single Layer 

We start by describing our model in the single layer case, which implements a single memory hop 
operation. We then show it can be stacked to give multiple hops in memory. 

Input memory representation: Suppose we are given an input set xi,.., Xi to be stored in memory. 
The entire set of {x^} are converted into memory vectors {rrii} of dimension d computed by 
embedding each Xi in a continuous space, in the simplest case, using an embedding matrix A (of 
size d X V). The query q is also embedded (again, in the simplest case via another embedding matrix 
B with the same dimensions as A) to obtain an internal state u. In the embedding space, we compute 
the match between u and each memory rrii by taking the inner product followed by a softmax: 

Pi — Softmax(u^mi). (1) 

where Softmax(zi) = ■ Dehned in this way p is a probability vector over the inputs. 

Output memory representation: Each Xi has a corresponding output vector Ci (given in the 
simplest case by another embedding matrix C). The response vector from the memory o is then a 
sum over the transformed inputs Ci, weighted by the probability vector from the input: 

o = '^PiCi. ( 2 ) 

i 

Because the function from input to output is smooth, we can easily compute gradients and back- 
propagate through it. Other recently proposed forms of memory or attention take this approach, 
notably Bahdanau et al. m and Graves et al. El, see also a. 

Generating the final prediction: In the single layer case, the sum of the output vector o and the 
input embedding u is then passed through a final weight matrix W (of size V x d) and a softmax 
to produce the predicted label: 

d = Softmax(IE(o + u)) (3) 


The overall model is shown in Fig.[TJa). During training, all three embedding matrices A, B and C, 
as well as W are jointly learned by minimizing a standard cross-entropy loss between d and the true 
label a. Training is performed using stochastic gradient descent (see Section 4.2 for more details). 



Figure 1: (a): A single layer version of our model, (b): A three layer version of our model. In 
practice, we can constrain several of the embedding matrices to be the same (see Section 2.2 1 . 


2.2 Multiple Layers 


We now extend our model to handle K hop operations. The memory layers are stacked in the 
following way: 


• The input to layers above the hrst is the sum of the output and the input from layer k 
(different ways to combine and are proposed later): 


= «'=-f o^ 


(4) 
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• Each layer has its own embedding matrices ^ C^, used to embed the inputs However, as 
discussed below, they are constrained to ease training and reduce the number of parameters. 

• At the top of the network, the input to W also combines the input and the output of the top 
memory layer: a = Softmax(VEu^+^) = Softmax(lE(o'^ + u^))- 

We explore two types of weight tying within the model: 

1 . Adjacent: the output embedding for one layer is the input embedding for the one above, 

i.e. = C^. We also constrain (a) the answer prediction matrix to be the same as the 

final output embedding, i.e , and (b) the question embedding to match the input 

embedding of the first layer, i.e. B = A^. 

2. Layer-wise (RNN-like): the input and output embeddings are the same across different 

layers, i.e. A^ = = ... = A^ and = ... = C^. We have found it useful to 

add a linear mapping H to the update of u between hops; that is, = Hu^ + . This 

mapping is learnt along with the rest of the parameters and used throughout our experiments 
for layer-wise weight tying. 

A three-layer version of our memory model is shown in Fig. [TJb). Overall, it is similar to the 
Memory Network model in 12^ . except that the hard max operations within each layer have been 
replaced with a continuous weighting from the softmax. 

Note that if we use the layer-wise weight tying scheme, our model can be cast as a traditional 
RNN where we divide the outputs of the RNN into internal and external outputs. Emitting an 
internal output corresponds to considering a memory, and emitting an external output corresponds 
to predicting a label. From the RNN point of view, u in Fig.[TJb) and Eqn.j^is a hidden state, and 
the model generates an internal output p (attention weights in Fig. [TJa)) using A. The model then 
ingests p using C, updates the hidden state, and so orQ Here, unlike a standard RNN, we explicitly 
condition on the outputs stored in memory during the K hops, and we keep these outputs soft, 
rather than sampling them. Thus our model makes several computational steps before producing an 
output meant to be seen by the “outside world”. 

3 Related Work 


A number of recent efforts have explored ways to capture long-term structure within sequences 
using RNNs or FSTM-based models iiiiiiiiiiBiiioiii]. The memory in these models is the state 
of the network, which is latent and inherently unstable over long timescales. The FSTM-based 
models address this through local memory cells which lock in the network state from the past. In 
practice, the performance gains over carefully trained RNNs are modest (see Mikolov et al. usi). 
Our model differs from these in that it uses a global memory, with shared read and write functions. 
However, with layer-wise weight tying our model can be viewed as a form of RNN which only 
produces an output after a fixed number of time steps (corresponding to the number of hops), with 
the intermediary steps involving memory input/output operations that update the internal state. 

Some of the very early work on neural networks by Steinbuch and Piske lfT^ and Taylor ETIl con¬ 
sidered a memory that performed nearest-neighbor operations on stored input vectors and then fit 
parametric models to the retrieved sets. This has similarities to a single layer version of our model. 

Subsequent work in the 1990’s explored other types of memory iniiiiiii. For example. Das 
et al. Q and Mozer et al. ifT^ introduced an explicit stack with push and pop operations which has 
been revisited recently by im in the context of an RNN model. 


Closely related to our model is the Neural Turing Machine of Graves et al. El, which also uses 
a continuous memory representation. The NTM memory uses both content and address-based 
access, unlike ours which only explicitly allows the former, although the temporal features that we 
will introduce in Section 4.1 allow a kind of address-based access. However, in part because we 
always write each memory sequentially, our model is somewhat simpler, not requiring operations 
like sharpening. Furthermore, we apply our memory model to textual reasoning tasks, which 
qualitatively differ from the more abstract operations of sorting and recall tackled by the NTM. 


’Note that in this view, the terminology of input and output from Fig. [^is flipped - when viewed as a 
traditional RNN with this special conditioning of outputs, A becomes part of the output embedding of the 
RNN and C becomes the input embedding. 
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Our model is also related to Bahdanau et al. m. In that work, a bidirectional RNN based encoder 
and gated RNN based decoder were used for machine translation. The decoder uses an attention 
model that finds which hidden states from the encoding are most useful for outputting the next 
translated word; the attention model uses a small neural network that takes as input a concatenation 
of the current hidden state of the decoder and each of the encoders hidden states. A similar attention 
model is also used in Xu et al. Il24ll for generating image captions. Our “memory” is analogous to 
their attention mechanism, although 12 is only over a single sentence rather than many, as in our 
case. Furthermore, our model makes several hops on the memory before making an output; we will 
see below that this is important for good performance. There are also differences in the architecture 
of the small network used to score the memories compared to our scoring approach; we use a simple 
linear layer, whereas they use a more sophisticated gated architecture. 

We will apply our model to language modeling, an extensively studied task. Goodman m showed 
simple but effective approaches which combine n-grams with a cache. Bengio et al. 12 ignited 
interest in using neural network based models for the task, with RNNs m and LSTMs HOjEa 
showing clear performance gains over traditional methods. Indeed, the current state-of-the-art is 
held by variants of these models, for example very large LSTMs with Dropout l25]l or RNNs with 
diagonal constraints on the weight matrix 1I2. With appropriate weight tying, our model can be 
regarded as a modified form of RNN, where the recurrence is indexed by memory lookups to the 
word sequence rather than indexed by the sequence itself. 

4 Synthetic Question and Answering Experiments 

We perform experiments on the synthetic QA tasks defined in ll22l (using version 1.1 of the dataset). 
A given QA task consists of a set of statements, followed by a question whose answer is typically 
a single word (in a few tasks, answers are a set of words). The answer is available to the model at 
training time, but must be predicted at test time. There are a total of 20 different types of tasks that 
probe different forms of reasoning and deduction. Here are samples of three of the tasks; 

Sam walks into the kitchen. I Brian is a lion. I Mary journeyed to the den. 

Sam picks up an apple. Julius is a lion. I Mary went back to the kitchen. 

Sam walks into the bedroom. Julius is white. I John journeyed to the bedroom. 

Sam drops the apple. Bernhard is green. I Mary discarded the milk. 

Q: Where is the apple? Q: What color is Brian? I Q: Where was the milk before the den? 

A. Bedroom I A. White I A. Hallway 

Note that for each question, only some subset of the statements contain information needed for 
the answer, and the others are essentially irrelevant distractors (e.g. the first sentence in the first 
example). In the Memory Networks of Weston et al. Il22 . this supporting subset was explicitly 
indicated to the model during training and the key difference between that work and this one is that 
this information is no longer provided. Hence, the model must deduce for itself at training and test 
time which sentences are relevant and which are not. 

Formally, for one of the 20 QA tasks, we are given example problems, each having a set of / 
sentences {ccj where I < 320; a question sentence q and answer a. Let the jth word of sentence 
i be Xij, represented by a one-hot vector of length V (where the vocabulary is of size V = 177, 
reflecting the simplistic nature of the QA language). The same representation is used for the 
question q and answer a. Two versions of the data are used, one that has 1000 training problems 
per task and a second larger one with 10,000 per task. 

4.1 Model Details 

Unless otherwise stated, all experiments used a K = 3 hops model with the adjacent weight sharing 
scheme. For all tasks that output lists (i.e. the answers are multiple words), we take each possible 
combination of possible outputs and record them as a separate answer vocabulary word. 

Sentence Representation: In our experiments we explore two different representations for 
the sentences. The first is the bag-of-words (BoW) representation that takes the sentence 
Xi = {xii,Xi 2 ,..., Xin}, embeds each word and sums the resulting vectors: e.g rrii = ^Xij and 
Ci = Cxij. The input vector u representing the question is also embedded as a bag of words; 
u = ^Qj- This has the drawback that it cannot capture the order of the words in the sentence, 
which is important for some tasks. 

We therefore propose a second representation that encodes the position of words within the 
sentence. This takes the form: rrii = h ' where • is an element-wise multiplication. Ij is a 
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column vector with the structure Ikj = {1 — j/ J) — {k/d){l — 2j/ J) (assuming 1-based indexing), 
with J being the number of words in the sentence, and d is the dimension of the embedding. This 
sentence representation, which we call position encoding (PE), means that the order of the words 
now affects rrii. The same representation is used for questions, memory inputs and memory outputs. 

Temporal Encoding: Many of the QA tasks require some notion of temporal context, i.e. in 
the first example of Section]^ the model needs to understand that Sam is in the bedroom after 
he is in the kitchen. To enable our model to address them, we modify the memory vector so 
that rrii = Axij + TA{i), where T^(z) is the ith row of a special matrix Ta that encodes 
temporal information. The output embedding is augmented in the same way with a matrix Tc 
(e.g. Ci = Cxij + Tc{i)). Both Ta and Tc are learned during training. They are also subject to 
the same sharing constraints as A and C. Note that sentences are indexed in reverse order, reflecting 
their relative distance from the question so that xi is the last sentence of the story. 

Learning time invariance by injecting random noise: we have found it helpful to add “dummy” 
memories to regularize Ta- That is, at training time we can randomly add 10% of empty memories 
to the stories. We refer to this approach as random noise (RN). 

4.2 Training Details 

10% of the bAbI training set was held-out to form a validation set, which was used to select the 
optimal model architecture and hyperparameters. Our models were trained using a learning rate of 
rj = 0.01, with anneals every 25 epochs by p/2 until 100 epochs were reached. No momentum or 
weight decay was used. The weights were initialized randomly from a Gaussian distribution with 
zero mean and cr = 0.1. When trained on all tasks simultaneously with Ik training samples (10k 
training samples), 60 epochs (20 epochs) were used with learning rate anneals of p/2 every 15 
epochs (5 epochs). All training uses a batch size of 32 (but cost is not averaged over a batch), and 
gradients with an £2 norm larger than 40 are divided by a scalar to have norm 40. In some of our 
experiments, we explored commencing training with the softmax in each memory layer removed, 
making the model entirely linear except for the final softmax for answer prediction. When the 
validation loss stopped decreasing, the softmax layers were re-inserted and training recommenced. 
We refer to this as linear start (LS) training. In LS training, the initial learning rate is set to 
p = 0.005. The capacity of memory is restricted to the most recent 50 sentences. Since the number 
of sentences and the number of words per sentence varied between problems, a null symbol was 
used to pad them all to a fixed size. The embedding of the null symbol was constrained to be zero. 

On some tasks, we observed a large variance in the performance of our model (i.e. sometimes failing 
badly, other times not, depending on the initialization). To remedy this, we repeated each training 
10 times with different random initializations, and picked the one with the lowest training error. 

4.3 Baselines 

We compare our approacl0(abbreviated to MemN2N) to a range of alternate models: 

• MemNN: The strongly supervised AMh-NGh-NL Memory Networks approach, proposed in ll22l . 
This is the best reported approach in that paper. It uses a max operation (rather than softmax) at 
each layer which is trained directly with supporting facts (strong supervision). It employs n-gram 
modeling, nonlinear layers and an adaptive number of hops per query. 

• MemNN-WSH: A weakly supervised heuristic version of MemNN where the supporting sen¬ 
tence labels are not used in training. Since we are unable to backpropagate through the max 
operations in each layer, we enforce that the first memory hop should share at least one word with 
the question, and that the second memory hop should share at least one word with the hrst hop and 
at least one word with the answer. All those memories that conform are called valid memories, 
and the goal during training is to rank them higher than invalid memories using the same ranking 
criteria as during strongly supervised training. 

• LSTM: A standard LSTM model, trained using question / answer pairs only (i.e. also weakly 
supervised). For more detail, see Ea. 


^ MemN2N source code is available at https : //github. com/f acebook/MemNN 
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4.4 Results 


We report a variety of design choices; (i) BoW vs Position Encoding (PE) sentence representation; 
(ii) training on all 20 tasks independently vs jointly training (joint training used an embedding 
dimension of d = 50, while independent training used d = 20); (iii) two phase training: linear start 
(LS) where softmaxes are removed initially vs training with softmaxes from the start; (iv) varying 
memory hops from 1 to 3. 

The results across all 20 tasks are given in Table for the Ik training set, along with the mean 
performance for 10k training se|^ They show a number of interesting points: 


• The best MemN2N models are reasonably close to the supervised models (e.g. Ik; 6.7% for 
MemNN vs 12.6% for MemN2N with position encoding + linear start + random noise, jointly 
trained and 10k: 3.2% for MemNN vs 4.2% for MemN2N with position encoding + linear start + 
random noise + non-linearit}j^ although the supervised models are still superior. 


• All variants of our proposed model comfortably beat the weakly supervised baseline methods. 

• The position encoding (PE) representation improves over bag-of-words (BoW), as demonstrated 
by clear improvements on tasks 4, 5, 15 and 18, where word ordering is particularly important. 

• The linear start (LS) to training seems to help avoid local minima. See task 16 in Table[T] where 
PE alone gets 53.6% error, while using LS reduces it to 1.6%. 

• Jittering the time index with random empty memories (RN) as described in Section 4.1 gives a 
small but consistent boost in performance, especially for the smaller Ik training set. 


• Joint training on all tasks helps. 

• Importantly, more computational hops give improved performance. We give examples of 
the hops performed (via the values of eq. Q) over some illustrative examples in Fig. and in 
Appendix [B] 




Baseline 

MemN2N 



Strongly 






PE 

1 hop 

2 hops 

3 hops 

PE 

PELS 



Supervised 

LSTM 

MemNN 



PE 

LS 

PE LS 

PE LS 

PELS 

LS RN 

LW 

Task 

MemNN |22l 

mi 

WSH 

BoW 

PE 

LS 

RN 

joint 

joint 

joint 

Joint 

joint 

1: 

[ supporting fact 

0.0 

50.0 

0.1 

0.6 

0.1 

0.2 

0.0 

0.8 

0.0 

0.1 

0.0 

0.1 

2: 

2 supporting facts 

0.0 

80.0 

42.8 

17.6 

21.6 

12.8 

8.3 

62.0 

15.6 

14.0 

11.4 

18.8 

3: 

3 supporting facts 

0.0 

80.0 

76.4 

71.0 

64.2 

58.8 

40.3 

76.9 

31.6 

33.1 

21.9 

31.7 

4: 

2 argument relations 

0.0 

39.0 

40.3 

32.0 

3.8 

11.6 

2.8 

22.8 

2.2 

5.7 

13.4 

17.5 

5: 

3 argument relations 

2.0 

30.0 

16.3 

18.3 

14.1 

15.7 

13.1 

II.O 

13.4 

14.8 

14.4 

12.9 

6: yes/iio questions 

0.0 

52.0 

51.0 

8.7 

7.9 

8.7 

7.6 

7.2 

2.3 

3.3 

2.8 

2.0 

7: counting 

15.0 

51.0 

36.1 

23.5 

21.6 

20.3 

17.3 

15.9 

25.4 

17.9 

18.3 

10.1 

8: 

ists/sets 

9.0 

55.0 

37.8 

11.4 

12.6 

12.7 

10.0 

13.2 

11.7 

10.1 

9.3 

6.1 

9: 

simple negation 

0.0 

36.0 

35.9 

21.1 

23.3 

17.0 

13.2 

5.1 

2.0 

3.1 

1.9 

1.5 

10 

indefinite knowledge 

2.0 

56.0 

68.7 

22.8 

17.4 

18.6 

15.1 

10.6 

5.0 

6.6 

6.5 

2.6 

11 

basic coreference 

0.0 

38.0 

30.0 

4.1 

4.3 

0.0 

0.9 

8.4 

1.2 

0.9 

0.3 

3.3 

12 

conjunction 

0.0 

26.0 

10.1 

0.3 

0.3 

0.1 

0.2 

0.4 

0.0 

0.3 

0.1 

0.0 

13 

compound coreference 

0.0 

6.0 

19.7 

10.5 

9.9 

0.3 

0.4 

6.3 

0.2 

1.4 

0.2 

0.5 

14 

time reasoning 

1.0 

73.0 

18.3 

1.3 

1.8 

2.0 

1.7 

36.9 

8.1 

8.2 

6.9 

2.0 

15 

basic deduction 

0.0 

79.0 

64.8 

24.3 

0.0 

0.0 

0.0 

46.4 

0.5 

0.0 

0.0 

1.8 

16 

basic induction 

0.0 

77.0 

50.5 

52.0 

52.1 

1.6 

1.3 

47.4 

51.3 

3.5 

2.7 

51.0 

17 

positional reasoning 

35.0 

49.0 

50.9 

45.4 

50.1 

49.0 

51.0 

44.4 

41.2 

44.5 

40.4 

42.6 

18 

size reasoning 

5.0 

48.0 

51.3 

48.1 

13.6 

10.1 

11.1 

9.6 

10.3 

9.2 

9.4 

9.2 

19 

path finding 

64.0 

92.0 

100.0 

89.7 

87.4 

85.6 

82.8 

90.7 

89.9 

90.2 

88.0 

90.6 

20 

agent's motivation 

0.0 

9.0 

3.6 

0.1 

0.0 

0.0 

0.0 

0.0 

0.1 

0.0 

0.0 

0.2 

Mean error (%) 

6.7 

51.3 

40.2 

25.1 

20.3 

16.3 

13.9 

25.8 

15.6 

13.3 

12.4 

15.2 

Failed tasks (err. > 5%) 

4 

20 

18 

15 

13 

12 

11 

17 

11 

11 

11 

10 

On 10k training data 













Mean error (%) 

3.2 

36.4 

39.2 

15.4 

9.4 

7.2 

6.6 

24.5 

10.9 

7.9 

7.5 

11.0 

Failed tasks (err. > 5%) 

2 

16 

17 

9 

6 

4 

4 

16 

7 

6 

6 

6 


Table 1; Test error rates (%) on the 20 QA tasks for models using Ik training examples (mean 
test errors for 10k training examples are shown at the bottom). Key: BoW = bag-of-words 
representation; PE = position encoding representation; LS = linear start training; RN = random 
injection of time index noise; LW = RNN-style layer-wise weight tying (if not stated, adjacent 
weight tying is used); joint = joint training on all tasks (as opposed to per-task training). 


5 Language Modeling Experiments 

The goal in language modeling is to predict the next word in a text sequence given the previous 
words X. We now explain how our model can easily be applied to this task. 

^More detailed results for the 10k training set can be found in Appendix [ a| 

"'Following im we found adding more non-linearity solves tasks 17 and 19, see Appendix [ a| 
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Storv (1: 1 suDDortina fact) 

SuDDort 

Hod 1 

Hop 2 

Hops 

Daniel went to the bathroom. 


0.00 

0.00 

0.03 

Mary travelled to the hallway. 


0.00 

0.00 

0.00 

John went to the bedroom. 


0.37 

0.02 

0.00 

John travelled to the bathroom. 

yes 

0.60 

0.98 

0.96 

Marv went to the office. 


0.01 

0.00 

0.00 

Where is John? Answer: bathroom 

Prediction: bathroom 



Storv f16: basic induction) 

SuDDort 

Hod 1 

Hop 2 

Hops 

Brian Is a frog. 

yes 

0.00 

0.98 

0.00 

Lily is gray. 

0.07 

0.00 

0.00 

Brian Is yellow. 

yes 

0.07 

0.00 

1.00 

Julius Is green. 


0.06 

0.00 

0.00 

Grea is a froa. 

yes 

0.76 

0.02 

0.00 

What color is Greg? Answer: yellow 

Prediction: yeliow 



Storv ( 2 : 2 suDDortina facts) 

SuDDort 

Hop 1 

Hop 2 

Hops 

John dropped the milk. 


0.06 

0.00 

0.00 

John took the milk there. 

yes 

0.88 

1.00 

0.00 

Sandra went back to the bathroom. 

0.00 

0.00 

0.00 

John moved to the hallway. 

yes 

0.00 

0.00 

1.00 

Marv went back to the bedroom. 


0.00 

0.00 

0.00 

Where is the milk? Answer: haliway 

Prediction: hallway 



Storv f18: size reasoning) 

Support 

Hop 1 

Hop 2 

Hops 

The suitcase is bigger than the chest. 

yes 

0.00 

0.88 

0.00 

The box is bigger than the chocolate. 

0.04 

0.05 

0.10 

The chest is bigger than the chocolate. 

yes 

0.17 

0.07 

0.90 

The chest fits inside the container. 


0.00 

0.00 

0.00 

The chest fits inside the box. 


0.00 

0.00 

0.00 

Does the suitcase fit in the chocolate? 

Answer: no Prediction: no 


Figure 2: Example predictions on the QA tasks of ll22l . We show the labeled supporting facts 
(support) from the dataset which MemN2N does not use during training, and the probabilities p of 
each hop used by the model during inference. MemN2N successfully learns to focus on the correct 
supporting sentences. 


Model 

#of 

hidden 

Penn Treebank 
# of memory Valid, 

hops size perp. 

Test 

perp. 

#of 

hidden 

#of 

hops 

Texts 

memory 

size 

Valid. 

perp. 

Test 

perp. 

RNN (15) 

300 

- 


133 

129 

500 

- 


- 

184 

LSTM fTSl 

100 

- 


120 

115 

500 

- 


122 

154 

SCRN (B) 

100 

- 

- 

120 

115 

500 

- 

- 

- 

161 

MemN2N 

150 

2 

100 

128 

121 

500 

2 

100 

152 

187 


150 

3 

100 

129 

122 

500 

3 

100 

142 

178 


150 

4 

100 

127 

120 

500 

4 

100 

129 

162 


150 

5 

100 

127 

118 

500 

5 

100 

123 

154 


150 

6 

100 

122 

115 

500 

6 

100 

124 

155 


150 

7 

100 

120 

114 

500 

7 

100 

118 

147 


150 

6 

25 

125 

118 

500 

6 

25 

131 

163 


150 

6 

50 

121 

114 

500 

6 

50 

132 

166 


150 

6 

75 

122 

114 

500 

6 

75 

126 

158 


150 

6 

100 

122 

115 

500 

6 

100 

124 

155 


150 

6 

125 

120 

112 

500 

6 

125 

125 

157 


150 

6 

150 

121 

114 

500 

6 

150 

123 

154 


150 

7 

200 

118 

111 

- 

- 

- 

- 

- 


Table 2; The perplexity on the test sets of Penn Treebank and TextS corpora. Note that increasing 
the number of memory hops improves performance. 


OD 2 
Q. 

04 

6 



20 40 60 80 100 

memory position 



20 40 60 80 100 

memory position 


Figure 3: Average activation weight of memory positions during 6 memory hops. White color 
indicates where the model is attending during the hop. For clarity, each row is normalized to 
have maximum value of 1. A model is trained on (left) Penn Treebank and (right) TextS dataset. 


We now operate on word level, as opposed to the sentence level. Thus the previous N words in the 
sequence (including the current) are embedded into memory separately. Each memory cell holds 
only a single word, so there is no need for the BoW or linear mapping representations used in the 
QA tasks. We employ the temporal embedding approach of Section 4.1 


Since there is no longer any question, q in Fig. [T] is fixed to a constant vector 0.1 (without 
embedding). The output softmax predicts which word in the vocabulary (of size V) is next in the 
sequence. A cross-entropy loss is used to train model by backpropagating the error through multiple 
memory layers, in the same manner as the QA tasks. To aid training, we apply ReFU operations to 
half of the units in each layer. We use layer-wise (RNN-like) weight sharing, i.e. the query weights 


of each layer are the same; the output weights of each layer are the same. As noted in Section 2.2 


this makes our architecture closely related to an RNN which is traditionally used for language 
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modeling tasks; however here the “sequence” over which the network is recurrent is not in the text, 
but in the memory hops. Furthermore, the weight tying restricts the number of parameters in the 
model, helping generalization for the deeper models which we find to be effective for this task. We 
use two different datasets: 

Penn Tree Bank l(T3]l : This consists of 929k/73k/82k train/validation/test words, distributed over a 
vocabulary of 10k words. The same preprocessing as 1251 was used. 

Texts ifTSll : This is a a pre-processed version of the first lOOM million characters, dumped from 
Wikipedia. This is split into 93.3M/5.7M/1M character train/validation/test sets. All word occurring 
less than 5 times are replaced with the <UNK> token, resulting in a vocabulary size of ~44k. 

5.1 Training Details 

The training procedure we use is the same as the QA tasks, except for the following. For each 
mini-batch update, the norm of the whole gradient of all parameters is measurecj^ and if larger 
than L = 50, then it is scaled down to have norm L. This was crucial for good performance. We 
use the learning rate annealing schedule from na, namely, if the validation cost has not decreased 
after one epoch, then the learning rate is scaled down by a factor 1.5. Training terminates when the 
learning rate drops below 10“®, i.e. after 50 epochs or so. Weights are initialized using A/^(0, 0.05) 
and batch size is set to 128. On the Penn tree dataset, we repeat each training 10 times with different 
random initializations and pick the one with smallest validation cost. Flowever, we have done only 
a single training run on Text8 dataset due to limited time constraints. 

5.2 Results 

Table [^compares our model to RNN, LSTM and Structurally Constrained Recurrent Nets (SCRN) 
Ga baselines on the two benchmark datasets. Note that the baseline architectures were tuned in 
ca to give optimal perplexit}J^ Our MemN2N approach achieves lower perplexity on both datasets 
(111 vs 115 for RNN/SCRN on Penn and 147 vs 154 for LSTM on Text8). Note that MemN2N 
has ~1.5x more parameters than RNNs with the same number of hidden units, while LSTM has 
^4x more parameters. We also vary the number of hops and memory size of our MemN2N, 
showing the contribution of both to performance; note in particular that increasing the number of 
hops helps. In Fig. we show how MemN2N operates on memory with multiple hops. It shows 
the average weight of the activation of each memory position over the test set. We can see that 
some hops concentrate only on recent words, while other hops have more broad attention over all 
memory locations, which is consistent with the idea that succesful language models consist of a 
smoothed n-gram model and a cache Ga. Interestingly, it seems that those two types of hops tend 
to alternate. Also note that unlike a traditional RNN, the cache does not decay exponentially: it 
has roughly the same average activation across the entire memory. This may be the source of the 
observed improvement in language modeling. 

6 Conclusions and Future Work 

In this work we showed that a neural network with an explicit memory and a recurrent attention 
mechanism for reading the memory can be successfully trained via backpropagation on diverse tasks 
from question answering to language modeling. Compared to the Memory Network implementation 
of Il23l there is no supervision of supporting facts and so our model can be used in a wider range 
of settings. Our model approaches the same performance of that model, and is significantly better 
than other baselines with the same level of supervision. On language modeling tasks, it slightly 
outperforms tuned RNNs and LSTMs of comparable complexity. On both tasks we can see that 
increasing the number of memory hops improves performance. 

However, there is still much to do. Our model is still unable to exactly match the performance of 
the memory networks trained with strong supervision, and both fail on several of the Ik QA tasks. 
Furthermore, smooth lookups may not scale well to the case where a larger memory is required. For 
these settings, we plan to explore multiscale notions of attention or hashing, as proposed in ll23l . 
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Appendix A Results on 10k QA dataset 



Baseline 

MemN2N 


Strongly 






PE 

PE LS 

1 hop 

2 hops 

3 hops 

PE 

PELS 


Supervised 


MemNN 



PE 

LS 

LW 

PE LS 

PE LS 

PE LS 

LSRN 

LW 

Task 

MemNN 

LSTM 

WSH 

BoW 

PE 

LS 

RN 

RN* 

joint 

joint 

joint 

joint 

joint 

1: 1 supporting fact 

0.0 

0.0 

0.1 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2; 2 supporting facts 

0.0 

81.9 

39.6 

0.6 

0.4 

0.5 

0.3 

0.3 

62.0 

1.3 

2.3 

1.0 

0.8 

3: 3 supporting facts 

0.0 

83.1 

79.5 

17.8 

12.6 

15.0 

9.3 

2.1 

80.0 

15.8 

14.0 

6.8 

18.3 

4; 2 argument relations 

0.0 

0.2 

36.6 

31.8 

0.0 

0.0 

0.0 

0.0 

21.4 

0.0 

0.0 

0.0 

0.0 

5: 3 argument relations 

0.3 

1.2 

21.1 

14.2 

0.8 

0.6 

0.8 

0.8 

8.7 

7.2 

7.5 

6.1 

0.8 

6: yes/no questions 

0.0 

51.8 

49.9 

0.1 

0.2 

0.1 

0.0 

0.1 

6.1 

0.7 

0.2 

0.1 

0.1 

7; counting 

3.3 

24.9 

35.1 

10.7 

5.7 

3.2 

3.7 

2.0 

14.8 

10.5 

6.1 

6.6 

8.4 

8: lists/sets 

1.0 

34.1 

42.7 

1.4 

2.4 

2.2 

0.8 

0.9 

8.9 

4.7 

4.0 

2.7 

1.4 

9; simple negation 

0.0 

20.2 

36.4 

1.8 

1.3 

2.0 

0.8 

0.3 

3.7 

0.4 

0.0 

0.0 

0.2 

10: indefinite knowledge 

0.0 

30.1 

76.0 

1.9 

1.7 

3.3 

2.4 

0.0 

10.3 

0.6 

0.4 

0.5 

0.0 

11: basic coreference 

0.0 

10.3 

25.3 

0.0 

0.0 

0.0 

0.0 

0.1 

8.3 

0.0 

0.0 

0.0 

0.4 

12: conjunction 

0.0 

23.4 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.1 

0.0 

13: compound coreference 

0.0 

6.1 

12.3 

0.0 

0.1 

0.0 

0.0 

0.0 

5.6 

0.0 

0.0 

0.0 

0.0 

14: time reasoning 

0.0 

81.0 

8.7 

0.0 

0.2 

0.0 

0.0 

0.1 

30.9 

0.2 

0.2 

0.0 

1.7 

15: basic deduction 

0.0 

78.7 

68.8 

12.5 

0.0 

0.0 

0.0 

0.0 

42.6 

0.0 

0.0 

0.2 

0.0 

16: basic induction 

0.0 

51.9 

50.9 

50.9 

48.6 

0.1 

0.4 

51.8 

47.3 

46.4 

0.4 

0.2 

49.2 

17: positional reasoning 

24.6 

50.1 

51.1 

47.4 

40.3 

41.1 

40.7 

18,6 

40.0 

39.7 

41.7 

41.8 

40.0 

18: size reasoning 

2.1 

6.8 

45.8 

41.3 

7.4 

8.6 

6.7 

5.3 

9.2 

10.1 

8.6 

8.0 

8.4 

19: path finding 

31.9 

90.3 

100.0 

75.4 

66.6 

66.7 

66.5 

2.3 

91.0 

80.8 

73.3 

75.7 

89.5 

20: agent’s motivation 

0.0 

2.1 

4.1 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

Mean error (%) 

3.2 

36.4 

39.2 

15.4 

9.4 

7.2 

6.6 

4.2 

24.5 

10.9 

7.9 

7.5 

11.0 

Failed tasks (err. > 5%) 

2 

16 

17 

9 

6 

4 

4 

3 

16 

7 

6 

6 

6 


Table 3: Test error rates (%) on the 20 bAbI QA tasks for models using 10k training examples. 
Key: BoW = bag-of-words representation; PE = position encoding representation; LS = linear start 
training; RN = random injection of time index noise; LW = RNN-style layer-wise weight tying (if 
not stated, adjacent weight tying is used); joint = joint training on all tasks (as opposed to per-task 
training); * = this is a larger model with non-linearity (embedding dimension is d = 100 and ReLU 
applied to the internal state after each hop. This was inspired by Oil and crucial for getting better 
performance on tasks 17 and 19). 
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Appendix B Visualization of attention weights in QA problems 


Storv (1: 1 suDDortina fact) 

Support 

Hop 1 

Hop 2 

Hop 3 

Daniel went to the bathroom. 


0.00 

0.00 

0.03 

Mary travelled to the hallway. 


0.00 

0.00 

0.00 

John went to the bedroom. 


0.37 

0.02 

0.00 

John travelled to the bathroom. 

yes 

0.60 

0.98 

0.96 

Mary went to the office. 


0.01 

0.00 

0.00 

Sandra iourneved to the kitchen. 


0.01 

0.00 

0.00 

Where is John? Answer: bathroom 

Prediction: bathroom 




Storv (3: 3 suDDortino facts) 

Support 

Hop 1 

Hop 2 

Hop 3 

John moved to the hallway. 


0.00 

0.00 

0.00 

John grabbed the football. 

yes 

0.00 

1.00 

0.00 

John journeyed to the garden. 


0.35 

0.00 

0.00 

Sandra moved to the hallway. 


0.00 

0.00 

0.00 

John went back to the hallway. 

yes 

0.00 

0.00 

1.00 

John journeyed to the garden. 

yes 

0.62 

0.00 

0.00 

Where was the football before the garden? A: hallway P 

hallway 




Storv (5: 3 argument relations) 

Support 

Hop 1 

Hop 2 

Hop 3 

Jeff travelled to the bedroom. 


0.00 

0.00 

0.00 

Jeff journeyed to the garden. 


0.00 

0.00 

0.00 

Fred handed the apple to Jeff. 

yes 

1.00 

1.00 

0.98 

Mary went to the garden. 


0.00 

0.00 

0.00 

Fred went back to the bathroom. 


0.00 

0.00 

0.00 

Fred got the milk there. 


0.00 

0.00 

0.00 

Marv iourneved to the kitchen. 


0.00 

0.00 

0.00 

Who gave the apple to Jeff? Answer: Fred Prediction: Fred 


Storv (7: counting) 

Support 

Hop 1 

Hop 2 

Hop 3 

Daniel moved to the office. 


0.00 

0.00 

0.00 

Mary moved to the office. 


0.00 

0.00 

0.00 

Sandra picked up the apple there. 

yes 

0.14 

0.00 

0.92 

Sandra dropped the apple. 

yes 

0.12 

0.00 

0.00 

Sandra took the apple there. 

yes 

0.73 

1.00 

0.08 

John went to the bedroom. 


0.00 

0.00 

0.00 

How many objects is Sandra carrying? Answer: one Prediction: one 


Storv (9: simple negation) 

Support 

Hon 1 

Hop 2 

Hop 3 

Sandra is in the garden. 


0.60 

0.99 

0.00 

Sandra is not in the garden. 

yes 

0.37 

0.01 

1.00 

John went to the office. 


0.00 

0.00 

0.00 

John is in the bedroom. 


0.00 

0.00 

0.00 

Daniel moved to the garden. 


0.00 

0.00 

0.00 

Is Sandra in the garden? Answer: no 

Prediction: no 





Storv (11: basic coherence) 

Support 

Hop 1 

Hop 2 

Hop 3 

Mary journeyed to the hallway. 


0.00 

0.01 

0.00 

After that she journeyed to the bathroom 


0.00 

0.00 

0.00 

Mary journeyed to the garden. 


0.00 

0.00 

0.00 

Then she went to the office. 


0.01 

0.06 

0.00 

Sandra journeyed to the garden. 

yes 

0.97 

0.42 

0.00 

Then she went to the hallway. 

yes 

0.00 

0.50 

1.00 

Where is Sandra? Answer: hallway 

Prediction: hallway 




Story (13: compound coherence) 

Support 

Hop 1 

Hop 2 

Hop 3 

Sandra and Daniel travelled to the bathroom. 

0.13 

0.00 

0.00 

Afterwards they went back to the office. 


0.01 

0.00 

0.00 

Daniel and Mary travelled to the hallway 


0.01 

0.00 

0.00 

Following that they went back to the office. 

0.06 

0.04 

0.00 

Mary and Sandra moved to the hallway. 

yes 

0.59 

0.02 

0.00 

Then they went to the kitchen. 

yes 

0.02 

0.94 

1.00 

Where is Sandra? Answer: kitchen Prediction: kitchen 


Storv (IS: basic deduction) 

Support 

Hop 1 

Hop 2 

Hop 3 

Cats are afraid of wolves. 

yes 

0.00 

0.99 

0.62 

Sheep are afraid of wolves. 


0.00 

0.00 

0.31 

Winona is a sheep. 


0.00 

0.00 

0.00 

Emily is a sheep. 


0.00 

0.00 

0.00 

Gertrude is a cat. 

yes 

0.99 

0.00 

0.00 

Wolves are afraid of mice. 


0.00 

0.00 

0.00 

Mice are afraid of wolves. 


0.00 

0.00 

0.07 

Jessica is a mouse. 


0.00 

0.00 

0.00 

What is gertrude afraid of? Answer: wolf Prediction: wolf 


Storv (17: oositional reasoning) 

Support 

Hop 1 

Hop 2 

Hop 3 

The red square is below the red sphere. 

yes 

0.37 

0.95 

0.58 

The red sphere Is below the triangle. 

yes 

0.63 

0.05 

0.43 

Is the triangle above the red square? 

Answer: yes Prediction: r 




Storv (2: 2 supporting facts) 

Support 

Hop 1 

Hop 2 

Hop 3 

John dropped the milk. 


0.06 

0.00 

0.00 

Daniel travelled to the bedroom. 


0.00 

0.00 

0.00 

John took the milk there. 

yes 

0.88 

1.00 

0.00 

Sandra went back to the bathroom. 

0.00 

0.00 

0.00 

John moved to the hallway. 

yes 

0.00 

0.00 

1.00 

Mary went back to the bedroom. 

0.00 

0.00 

0.00 

Where is the milk? Answer: hallway Prediction: hallway 


Storv (4: 2 argument relations) 

Support 

Hop 1 

Hop 2 

Hop 3 

The garden Is north of the kitchen. 

yes 

0.84 

1.00 

0.92 

The kitchen is north of the bedroom. 

0.16 

0.00 

0.08 

What is north of the kitchen? Answer: garden 

Prediction: garden 




Storv (6: ves/no questions) 

Support 

Hop 1 

Hop 2 

Hop 3 

Sandra travelled to the bedroom. 


0.06 

0.00 

0.01 

John took the football there. 


0.00 

0.00 

0.00 

Sandra travelled to the office. 


0.00 

0.45 

0.16 

Sandra went to the bedroom. 

yes 

0.89 

0.39 

0.04 

Daniel went back to the kitchen. 

0.00 

0.16 

0.00 

John took the apple there. 


0.00 

0.00 

0.00 

Mary got the milk there. 


0.00 

0.00 

0.00 

Is Sandra in the bedroom? Answer: yes Prediction: Yes 


Storv (8: lists/sets) 

Support 

Hop 1 

Hop 2 

Hop 3 

John moved to the hallway. 


0.00 

0.00 

0.00 

John journeyed to the garden. 


0.00 

0.00 

0.00 

Daniel moved to the garden. 


0.00 

0.01 

0.00 

Daniel grabbed the apple there. 

yes 

0.03 

0.00 

0.98 

Daniel got the milk there. 

yes 

0.97 

0.02 

0.00 

John went back to the hallwav. 


0.00 

0.00 

0.00 

What is Daniel carrying? Answer: apple,milk 

Prediction: apple,milk 




Storv (10: indefinite knowledge) 

Support 

Hop 1 

Hop 2 

Hop 3 

Julie is either in the school or the bedroom. 


0.00 

0.00 

0.00 

Julie is either in the cinema or the park. 


0.00 

0.00 

0.00 

BIN Is In the park. 


0.00 

0.00 

0.00 

BIN is either in the office or the office. 

yes 

1.00 

1.00 

1.00 

Is Bill in the office? Answer: maybe Prediction: maybe 


Storv (12: conjunction) 

Support 

Hop 1 

Hop 2 

Hop 3 

John and Sandra went back to the kitchen. 


0.08 

0.00 

0.00 

Sandra and Mary travelled to the garden. 


0.05 

0.00 

0.00 

Mary and Daniel travelled to the office. 


0.00 

0.00 

0.00 

Mary and John went to the bathroom. 


0.01 

0.00 

0.00 

Daniel and Sandra went to the kitchen. 

yes 

0.74 

1.00 

1.00 

Daniel and Mary journeyed to the office. 


0.06 

0.00 

0.00 

Where is Sandra? Answer: kitchen Prediction: kitchen 


Storv (14: time reasoning) 

Support 

Hop 1 

Hop 2 

Hop 3 

This morning Julie went to the cinema. 


0.00 

0.03 

0.00 

Julie journeyed to the kitchen yesterday. 


0.00 

0.04 

0.01 

Fred travelled to the cinema yesterday. 


0.00 

0.05 

0.01 

BIN travelled to the office yesterday. 


0.00 

0.07 

0.01 

This morning Mary travelled to the bedroom. 

yes 

0.97 

0.27 

0.01 

Yesterday Mary journeyed to the cinema. 

yes 

0.01 

0.33 

0.96 

Where was Mary before the bedroom? Answer: cinema Prediction: cinema 


Storv (16: basic induction) 

Support 

Hop 1 

Hop 2 

Hop 3 

Lily Is a swan. 


0.00 

0.00 

0.00 

Brian Is a frog. 

yes 

0.00 

0.98 

0.00 

Lily is gray. 


0.07 

0.00 

0.00 

Brian is yellow. 

yes 

0.07 

0.00 

1.00 

Julius Is a swan. 


0.00 

0.00 

0.00 

Bernhard is yellow. 


0.04 

0.00 

0.00 

Julius Is green. 


0.06 

0.00 

0.00 

Greg is a frog. 

yes 

0.76 

0.02 

0.00 

What color is Greg? Answer: yellow Prediction: yellow 


Storv (18: size reasoning) 

Support 

Hop 1 

Hop 2 

Hop 3 

The suitcase is bigger than the chest. 

yes 

0.00 

0.88 

0.00 

The box is bigger than the chocolate. 


0.04 

0.05 

0.10 

The chest is bigger than the chocolate. 

yes 

0.17 

0.07 

0.90 

The chest fits inside the container. 


0.00 

0.00 

0.00 

The chest fits inside the box. 


0.00 

0.00 

0.00 

Does the suitcase fit in the chocolate? Answer: 

no Prediction: no 




Storv (20: agent's motivation) 

Support 

Hop 1 

Hop 2 

Hop 3 

Yann journeyed to the kitchen. 


0.00 

0.00 

0.00 

Yann grabbed the apple there. 


0.00 

0.00 

0.00 

Antoine is thirsty. 

yes 

0.17 

0.00 

0.98 

Jason picked up (he milk (here. 

0.01 

0.00 

0.00 

Antoine travelled to the kitchen. 


0.77 

1.00 

0.00 

Why did antoine go to the kitchen? 

Answer: thirsty Prediction: thirsty 



Storv (19: path finding) 

Support 

Hop 1 

Hop 2 

Hop 3 

The hallway is north of the kitchen. 


1.00 

1.00 

1.00 

The garden Is south of the kitchen. 

yes 

0.00 

0.00 

0.00 

The garden is east of the bedroom. 

yes 

0.00 

0.00 

0.00 

The bathroom is south of the bedroom. 


0.00 

0.00 

0.00 

The office is east of the garden. 


0.00 

0.00 

0.00 

How do you go from the kitchen to the bedroom? Answer: s,w Prediction: n.n 


Figure 4: Examples of attention weights during different memory hops for the bAbi tasks. The 
model is PE+LS+RN with 3 memory hops that is trained separately on each task with 10k training 
data. The support column shows which sentences are necessary for answering questions. Although 
this information is not used, the model succesfully learns to focus on the correct support sentences 
on most of the tasks. The hop columns show where the model put more weight (indicated by values 
and blue color) during its three hops. The mistakes made by the model are highlighted by red color. 
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